10 (a) Let us use the boston Dataset


In [1]:
library(MASS)
head(Boston)
# ?Boston for information


crimzninduschasnoxrmagedisradtaxptratioblacklstatmedv
0.0063218 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7

How many rows and columns are in this dataset?


In [2]:
dim(Boston)


  1. 506
  2. 14

This Boston dataset has 506 sample rows and 14 columns (fields). Each row represents a Suburb in Boston and each column is a property of the suburb that helps determine the house pricing (which is the response variable) in the area.


10 (b) Let us create some pairwise scatter plots


In [22]:
pairs(Boston)


Since there are 14 predictors, scatter plot matrix becomes nearly illegible. Instead, we will get a birds eye view of our data using a correlation matrix.


In [12]:
corr_matrix = cor(Boston, method="pearson") # Generate Correlation Matrix
corr_matrix


crimzninduschasnoxrmagedisradtaxptratioblacklstatmedv
crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171 -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456 -0.38506394 0.4556215 -0.3883046
zn-0.20046922 1.00000000 -0.53382819 -0.042696719-0.51660371 0.31199059 -0.56953734 0.66440822 -0.311947826-0.31456332 -0.3916785 0.17552032 -0.4129946 0.3604453
indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145 -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476 -0.35697654 0.6037997 -0.4837252
chas-0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281 0.09125123 0.08651777 -0.09917578 -0.007368241-0.03558652 -0.1215152 0.04878848 -0.0539293 0.1752602
nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000 -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327 -0.38005064 0.5908789 -0.4273208
rm-0.21924670 0.31199059 -0.39167585 0.091251225-0.30218819 1.00000000 -0.24026493 0.20524621 -0.209846668-0.29204783 -0.3555015 0.12806864 -0.6138083 0.6953599
age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010 -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150 -0.27353398 0.6023385 -0.3769546
dis-0.37967009 0.66440822 -0.70802699 -0.099175780-0.76923011 0.20524621 -0.74788054 1.00000000 -0.494587930-0.53443158 -0.2324705 0.29151167 -0.4969958 0.2499287
rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056 -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412 -0.44441282 0.4886763 -0.3816262
tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320 -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530 -0.44180801 0.5439934 -0.4685359
ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268 -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000 -0.17738330 0.3740443 -0.5077867
black-0.38506394 0.17552032 -0.35697654 0.048788485-0.38005064 0.12806864 -0.27353398 0.29151167 -0.444412816-0.44180801 -0.1773833 1.00000000 -0.3660869 0.3334608
lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892 -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443 -0.36608690 1.0000000 -0.7376627
medv-0.38830461 0.36044534 -0.48372516 0.175260177-0.42732077 0.69535995 -0.37695457 0.24992873 -0.381626231-0.46853593 -0.5077867 0.33346082 -0.7376627 1.0000000

The following observations were made:

  • crim is positively correlated with indus, age, nox, rad, tax, and lstat; but negatively correlated with dis, black and medv
  • zn is positively correlated with dis; and is negatively correlated with indus, nox, age, lstat.
  • indus is positively correlated with nox, age, rad, tax, lstat and negatively correlated with dis and medv.
  • nox is posetively correlated with age, rad, tax, lstat and negatively correlated with dis, medv.
  • rm is posetively correlated with medv and negatively correlated with lstat.
  • age is posetively correlated with tax , lstat and negatively correlated with dis.
  • dis is negatively correlated with tax , lstat, rad.
  • rad is postitively correlated with tax, _ptratio, lstat and negatively correlated with black.
  • tax is postitively correlated with _ptratio, lstat and negatively correlated with black, medv.
  • ptratio is negatively correlated with medv.
  • lstat is negatively correlated with medv.

10(c) From the correlation matrix and the scatter plot, we can make the following observations about crime rate crim.

  • Greater the number of non-retail business acres (indus) per down, greater is the crime rate.
  • Older are the houses (age), greater the crime rate
  • Greater the nitrogen oxide concentration, greater the crime rate.
  • Greater the index of accessiblity to radial highways, greater the highways.
  • Higher the taxes, greater the crime rate
  • Lower the population status, higher the crime rate.
  • Further away are the employment centers (dis), less is the crime rate.
  • Greater the number of Blacks, less is the crime rate.
  • Higher the median value of house (medv), lower is the crime rate

10 (d) The names of suburbs are not given in the dataset. Finding the suburbs with high crimerates, tax rates or pupil teacher ratios can only be done relatively using a histogram. Let us determine the distribution of crime rates for the suburbs in our dataset.


In [24]:
hist(Boston$crim, breaks=20, xlab="Crime Rate", main="Histogram of Crime Rates")


It is clear that most of the suburb samples have low crime rates. Now Lets take a look at tax rates.


In [25]:
hist(Boston$tax, breaks=20, xlab="Tax Rate", main="Histogram of Tax Rates")


There are a lot of houses with high tax rates up to 440, and then we have a lot of suburbs with tax rates around 680. Not many suburbs have tax rates between these figures. Let us now take a look at pupil teacher ratios.


In [26]:
hist(Boston$ptratio, breaks=20, xlab="Pupil Teacher Ratio", main="Histogram of Pupil Teacher Ratios")


The histogram for pupil teacher ratios seems well distributed, but with a particularly high ratio around 20 to 20.5. Let us find out exactly how many such suburbs exist.


In [27]:
length(Boston$ptratio[20 < Boston$ptratio & Boston$ptratio < 20.5])


145

So there are 145 suburbs in our dataset of 506 that have a high pupil teacher ratio between 20 and 20.5. Pretty Interesting!


10 (e) Let us determine the number of suburbs that bound the Charles River.


In [28]:
length(Boston$chas[Boston$chas == 1])


35

So there are 35 rivers that Bound Charles.


10(f) What is the median pupil teacher ratio for the towns in this dataset?


In [29]:
median(Boston$ptratio)


19.05

10 (g) Let us now find the suburb of Boston has lowest median value of owner-occupied homes.


In [30]:
index = which.min(Boston$medv) #Get index  minimum medv
Boston[index,] #Access this row.


crimzninduschasnoxrmagedisradtaxptratioblacklstatmedv
39938.35180 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.9 30.59 5

So the 399th suburb in the dataset has the lowest median value for owner occupied homes (medv = 5). Let us see the nature of the other fields.


In [31]:
percentile = ecdf(Boston$crim) #ecdf takes a vector and returns function for computing percentile.
print(paste("Crime Rate = ", percentile(Boston[index,'crim'])))#We can now compute the "percentile" of a value


[1] "Crime Rate =  0.988142292490119"

Let us iterate over all fields fast to get the big picture.


In [32]:
fields = names(Boston)
for (field in 1:length(fields)){
    percentile = ecdf(Boston[[field]])
    print(paste(fields[field], " = ", percentile(Boston[index,'crim'])))
}


[1] "crim  =  0.988142292490119"
[1] "zn  =  0.885375494071146"
[1] "indus  =  1"
[1] "chas  =  1"
[1] "nox  =  1"
[1] "rm  =  1"
[1] "age  =  0.205533596837945"
[1] "dis  =  1"
[1] "rad  =  1"
[1] "tax  =  0"
[1] "ptratio  =  1"
[1] "black  =  0.0355731225296443"
[1] "lstat  =  1"
[1] "medv  =  0.934782608695652"

Wow! These values are extreme relative to the dataset. From this result, we can conclude that this suburb with the lowest median value of owner-occupied homes also has:

  • One of the highest crime rates
  • A high proportion of residential land zoned for lots (zn)
  • one of the highest proportion of non-retail business acres per town (indus)
  • One of the highest Nitrogen Oxide concentrations (nox).
  • Furthest from employment centers (dis).
  • Lowest taxes.
  • The highest pupil-teacher ratio
  • Less blacks
  • One of the lowest population status (lstat)

10 (h) Let us find the number of suburbs that average over 7 rooms per dwelling.


In [34]:
length(Boston$rm[Boston$rm > 7])


64

So around 64 suburbs have greater than 7 rooms per dwelling on average. Now let us see how many suburbs exceed 8.


In [35]:
length(Boston$rm[Boston$rm > 8])


13

13 suburbs in our dataset have greater than 8 rooms per dwelling on average. Let us see what kind of suburbs these are.


In [36]:
Boston[Boston$rm > 8,]


crimzninduschasnoxrmagedisradtaxptratioblacklstatmedv
980.12083 0 2.89 0 0.4450 8.069 76.0 3.4952 2 276 18.0 396.90 4.21 38.7
1641.51902 0 19.58 1 0.6050 8.375 93.9 2.1620 5 403 14.7 388.45 3.32 50.0
2050.0200995 2.68 0 0.4161 8.034 31.9 5.1180 4 224 14.7 390.55 2.88 50.0
2250.31533 0 6.20 0 0.5040 8.266 78.3 2.8944 8 307 17.4 385.05 4.14 44.8
2260.52693 0 6.20 0 0.5040 8.725 83.0 2.8944 8 307 17.4 382.00 4.63 50.0
2270.38214 0 6.20 0 0.5040 8.040 86.5 3.2157 8 307 17.4 387.38 3.13 37.6
2330.57529 0 6.20 0 0.5070 8.337 73.3 3.8384 8 307 17.4 385.91 2.47 41.7
2340.33147 0 6.20 0 0.5070 8.247 70.4 3.6519 8 307 17.4 378.95 3.95 48.3
2540.3689422 5.86 0 0.4310 8.259 8.4 8.9067 7 330 19.1 396.90 3.54 42.8
2580.6115420 3.97 0 0.6470 8.704 86.9 1.8010 5 264 13.0 389.70 5.12 50.0
2630.5201420 3.97 0 0.6470 8.398 91.5 2.2885 5 264 13.0 386.86 5.91 48.8
2680.5783420 3.97 0 0.5750 8.297 67.0 2.4216 5 264 13.0 384.54 7.44 50.0
3653.47428 0 18.10 1 0.7180 8.780 82.9 1.9047 24 666 20.2 354.55 5.29 21.9

In [37]:
summary( Boston[Boston$rm > 8,] )


      crim               zn            indus             chas       
 Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
 1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
 Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
 Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
 3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
 Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
      nox               rm             age             dis       
 Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
 1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
 Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
 Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
 3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
 Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
      rad              tax           ptratio          black      
 Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :354.6  
 1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:384.5  
 Median : 7.000   Median :307.0   Median :17.40   Median :386.9  
 Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :385.2  
 3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:389.7  
 Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :396.9  
     lstat           medv     
 Min.   :2.47   Min.   :21.9  
 1st Qu.:3.32   1st Qu.:41.7  
 Median :4.14   Median :48.3  
 Mean   :4.31   Mean   :44.2  
 3rd Qu.:5.12   3rd Qu.:50.0  
 Max.   :7.44   Max.   :50.0  

Let us try to compare these stats to those of the entire dataset.


In [38]:
summary(Boston)


      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio          black       
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00  

Some noticible factors for the suburbs with over 8 rooms per dwelling on average:

  • The range of crime rate (crim) is much lower
  • The range of lower population status (lstat) is much lower
  • The other fields seems to be similar for all suburbs of the dataset